programmatic policy
- North America > United States > California (0.14)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- Health & Medicine (1.00)
- Education (1.00)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (8 more...)
- Education (0.46)
- Leisure & Entertainment > Sports (0.46)
Multimodal LLM-assisted Evolutionary Search for Programmatic Control Policies
Hu, Qinglong, Tong, Xialiang, Yuan, Mingxuan, Liu, Fei, Lu, Zhichao, Zhang, Qingfu
Deep reinforcement learning has achieved impressive success in control tasks. However, its policies, represented as opaque neural networks, are often difficult for humans to understand, verify, and debug, which undermines trust and hinders real-world deployment. This work addresses this challenge by introducing a novel approach for programmatic control policy discovery, called Multimodal Large Language Model-assisted Evolutionary Search (MLES). MLES utilizes multimodal large language models as programmatic policy generators, combining them with evolutionary search to automate policy generation. It integrates visual feedback-driven behavior analysis within the policy generation process to identify failure patterns and guide targeted improvements, thereby enhancing policy discovery efficiency and producing adaptable, human-aligned policies. Experimental results demonstrate that MLES achieves performance comparable to Proximal Policy Optimization (PPO) across two standard control tasks while providing transparent control logic and traceable design processes. This approach also overcomes the limitations of predefined domain-specific languages, facilitates knowledge transfer and reuse, and is scalable across various tasks, showing promise as a new paradigm for developing transparent and verifiable control policies.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Hong Kong (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > Florida > Pinellas County > St. Petersburg (0.04)
- (4 more...)
- Education (0.46)
- Leisure & Entertainment > Sports (0.46)
- North America > United States > California (0.86)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- Health & Medicine (1.00)
- Education > Educational Setting > Higher Education (0.40)
Common Benchmarks Undervalue the Generalization Power of Programmatic Policies
Rajabpour, Amirhossein, Aghakasiri, Kiarash, Zilles, Sandra, Lelis, Levi H. S.
Algorithms for learning programmatic representations for sequential decision-making problems are often evaluated on out-of-distribution (OOD) problems, with the common conclusion that programmatic policies generalize better than neural policies on OOD problems. In this position paper, we argue that commonly used benchmarks undervalue the generalization capabilities of programmatic representations. We analyze the experiments of four papers from the literature and show that neural policies, which were shown not to generalize, can generalize as effectively as programmatic policies on OOD problems. This is achieved with simple changes in the neural policies training pipeline. Namely, we show that simpler neural architectures with the same type of sparse observation used with programmatic policies can help attain OOD generalization. Another modification we have shown to be effective is the use of reward functions that allow for safer policies (e.g., agents that drive slowly can generalize better). Also, we argue for creating benchmark problems highlighting concepts needed for OOD generalization that may challenge neural policies but align with programmatic representations, such as tasks requiring algorithmic constructs like stacks.
- North America > Canada > Alberta (0.14)
- Asia > Middle East > Jordan (0.04)
InnateCoder: Learning Programmatic Options with Foundation Models
Moraes, Rubens O., Sadmine, Quazi Asif, Baier, Hendrik, Lelis, Levi H. S.
Outside of transfer learning settings, reinforcement learning agents start their learning process from a clean slate. As a result, such agents have to go through a slow process to learn even the most obvious skills required to solve a problem. In this paper, we present InnateCoder, a system that leverages human knowledge encoded in foundation models to provide programmatic policies that encode "innate skills" in the form of temporally extended actions, or options. In contrast to existing approaches to learning options, InnateCoder learns them from the general human knowledge encoded in foundation models in a zero-shot setting, and not from the knowledge the agent gains by interacting with the environment. Then, InnateCoder searches for a programmatic policy by combining the programs encoding these options into larger and more complex programs. We hypothesized that InnateCoder's way of learning and using options could improve the sampling efficiency of current methods for learning programmatic policies. Empirical results in MicroRTS and Karel the Robot support our hypothesis, since they show that InnateCoder is more sample efficient than versions of the system that do not use options or learn them from experience.
- North America > Canada > Alberta (0.14)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- (4 more...)
- Leisure & Entertainment > Games (0.68)
- Education (0.45)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Reviews: Imitation-Projected Programmatic Reinforcement Learning
This paper addresses the problem of learning programmatic policies, which are structured in programmatic classes such as programming languages or regression trees. To this end, the paper proposes a "lift-and-project" framework (IPPG) that alternatively (1) optimizes a policy parameterized by a neural network in an unconstrained policy space and (2) projects the learned knowledge to space where the desired policy is constrained with a programmatic representation. Specifically, (1) is achieved by using deep policy gradient methods (e.g. DDPG, TRPO, etc.) and (2) is obtained by synthesizing programs to describe behaviors (program synthesis via imitation learning). The experiments on TORCS (a simulated car racing environment) show that the learned programmatic policies outperform the methods that imitate or distill a pre-trained neural policy and DDPG.
Programmatic Reinforcement Learning: Navigating Gridworlds
Shabadi, Guruprerana, Fijalkow, Nathanaël, Matricon, Théo
The field of reinforcement learning (RL) is concerned with algorithms for learning optimal policies in unknown stochastic environments. Programmatic RL studies representations of policies as programs, meaning involving higher order constructs such as control loops. Despite attracting a lot of attention at the intersection of the machine learning and formal methods communities, very little is known on the theoretical front about programmatic RL: what are good classes of programmatic policies? How large are optimal programmatic policies? How can we learn them? The goal of this paper is to give first answers to these questions, initiating a theoretical study of programmatic RL. Considering a class of gridworld environments, we define a class of programmatic policies. Our main contributions are to place upper bounds on the size of optimal programmatic policies, and to construct an algorithm for synthesizing them. These theoretical findings are complemented by a prototype implementation of the algorithm.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Pennsylvania (0.04)
- Europe > Poland > Masovia Province > Warsaw (0.04)
- Europe > France > Nouvelle-Aquitaine > Gironde > Bordeaux (0.04)